Logistics & Announcement


Part 1: Q3 on Exam 2

  1. (Bonus 3pts) Consider the prediction equation: \[\hat Y = a + bX\] where \(a\) and \(b\) are the intercept and slope coefficients respectively.

Suppose we construct two new variables \(Y^*\) and \(X^*\):

\[Y^* = 10 \cdot Y + 2\] \[X^* = 0.1 \cdot X\]

Suppose our sample data remain the same, but we now regress \(Y^∗\) on \(X^∗\) and obtain the following prediction equation: \[\hat Y^* = a^* + b^*X\]

How will the new intercept (\(a^∗\)) and slope (\(b^∗\)) relate to the original \(a\) and \(b\)? (i.e., I’m asking you to express \(a^∗\) and \(b^∗\) as functions of \(a\) and \(b\).)

Method 1: Express \(X\) and \(Y\) using \(X^*\) and \(Y^*\):

Based on the question, we can express \(X\) and \(Y\) using \(X^*\) and \(Y^*\) by transforming the \(X^*\) and \(Y^*\) expressions to:

\[Y = \frac{Y^* -2}{10} \] \[X = 10\cdot X^*\] Then \(Y = a + bX\) becomes:

\[\frac{Y^* -2}{10} = a + b\cdot 10\cdot X^*\] Arrange this to the form of \(\hat Y^* = a^* + b^*X^*\):

\[Y^* -2 = 10 \cdot (a + b\cdot 10\cdot X^*)\] \[Y^* = 10 \cdot (a + b\cdot 10\cdot X^*) + 2\] \[Y^* = 10a + 100b\cdot X^* + 2\] Now, gather the intercept part of the above equation, which are terms whose value is not dependent on \(X\), which include \(10a\) and \(+2\). Then, gather the slope part, which are terms whose value is dependent on \(X^*\), which is \(100b\cdot X^*\):

\[Y^* = (10a + 2) + 100 b\cdot X^*\] The above equation is \(\hat Y^* = a^* + b^*X\), and the \(a^*\) part is \((10a + 2)\), the \(b^*\) part is \(100b\).

Method 2: Express \(X^*\) and \(Y^*\) using \(X\) and \(Y\), then you get expressions of \(a\) and \(b\) using \(a^*\) and \(b^*\):

First, start with the equation \(\hat Y^* = a^* + b^*X\). Given that \(Y^* = 10Y + 2\) and \(X^* = 0.1X\), we can write \(\hat Y^* = a^* + b^*X\) as:

\[10Y + 2 = a^* + b^*\cdot 0.1X\] \[10Y = a^* + 0.1 b^*\cdot X - 2\] \[Y = \frac{a^* + 0.1 b^*\cdot X - 2}{10}\] \[Y = \frac{a^*}{10} + 0.01 b^*\cdot X - 0.2\]

Again, gather the intercept and the slope terms:

\[Y = (\frac{a^*}{10} - 0.2) + 0.01b^*\cdot X \] Since we know \(X\) and \(Y\) is related by \(Y = a + bX\), we now have:

\[a = (\frac{a^*}{10} - 0.2)\] \[b = 0.01\cdot b^*\] Then transform the above two equations to use \(a\) to express \(a^*\) and \(b\) to express \(b^*\):

\[\frac{a^*}{10} - 0.2 =a\] \[\frac{a^*}{10} =a + 0.2\] \[a^* =10a + 2\] And for \(b^*\):

\[b^* = 100b\]


Now we continue last week’s topic on visualizing results of a regression model. Today we will learn two more plots: the coefficient plot and the plot of predicted effect.

First, load packages to your environment.

knitr::opts_chunk$set(echo = TRUE)

library(pacman)
p_load(tidyverse, stargazer, kableExtra, coefplot)

Second, load the data and the model we ran last week.

# Load cleaned and recoded df
load("data/earnings_df.RData")

# Examine data
head(earnings_df, 10) %>% kbl("html") %>% kable_classic_2(full_width = F)
earn edu age_recode female black other
14.67010 3 28 1 0 1
71.91367 7 53 0 0 0
54.62454 7 41 0 0 0
64.54969 7 49 0 0 0
31.36042 9 41 1 0 0
56.65916 7 63 0 0 0
26.21015 1 60 0 1 0
52.45393 4 55 0 0 0
37.90477 7 23 1 1 0
46.28573 6 49 1 0 1
# Estimate Nested Models
m1 <- lm(earn ~ age_recode, 
         data = earnings_df)

m2 <- lm(earn ~ age_recode + edu,
         data = earnings_df)

m3 <- lm(earn ~ age_recode + edu + female,
         data = earnings_df)

m4 <- lm(earn ~ age_recode + edu + female + black + other,
         data = earnings_df)

m5 <- lm(earn ~ age_recode + edu + female + black + other + edu*female,
         data = earnings_df)

# Examine models
stargazer(m1, m2, m3, m4, m5, type="text", omit.stat=c("ser", "f"))
## 
## ================================================================
##                              Dependent variable:                
##              ---------------------------------------------------
##                                     earn                        
##                 (1)       (2)       (3)        (4)        (5)   
## ----------------------------------------------------------------
## age_recode   0.134***  0.132***   0.160***   0.158***  0.156*** 
##               (0.042)   (0.033)   (0.022)    (0.022)    (0.019) 
##                                                                 
## edu                    4.314***   4.500***   4.477***  6.083*** 
##                         (0.171)   (0.112)    (0.112)    (0.143) 
##                                                                 
## female                           -20.528*** -20.572***  -1.571  
##                                   (0.568)    (0.565)    (1.311) 
##                                                                 
## black                                       -2.307***  -2.385***
##                                              (0.623)    (0.557) 
##                                                                 
## other                                         -0.767    -0.946  
##                                              (1.137)    (1.017) 
##                                                                 
## edu:female                                             -3.128***
##                                                         (0.199) 
##                                                                 
## Constant     43.888*** 17.786*** 25.439***  26.429***  16.974***
##               (1.917)   (1.817)   (1.207)    (1.230)    (1.254) 
##                                                                 
## ----------------------------------------------------------------
## Observations    980       980       980        980        980   
## R2             0.010     0.400     0.744      0.747      0.798  
## Adjusted R2    0.009     0.399     0.743      0.746      0.797  
## ================================================================
## Note:                                *p<0.1; **p<0.05; ***p<0.01

Part 2: Coefficient Plots

Coefficient plot visualizes the coefficients with it’s confidence intervals. You can plot it easily using coefplot() from the coefplot package. There are also other packages that visualize coefficients.

# Defualt coefficient plot
coefplot(m5)

# Remove the intercept from the plot
coefplot(m5, intercept = F)

# The default innerCI is 1, which is 1 se around the point estimate
# the default outerCI is 2, which is 2 se around the point estimate
# You can set both to 1.96, which is the 95% confidence interval of betas
coefplot(m5, intercept = F, innerCI = 1.96, outerCI = 1.96)

# Or only keep the outerCI = 1.96
coefplot(m5, intercept = T, innerCI = F, outerCI = 1.96)

# You can also change the color, shape, and size of the texts
# as well as change plot titles and axes labels
# read the documentation for more
coefplot(m5, intercept = F, innerCI = F, outerCI = 1.96, 
         color = "black",                         # customize color
         title = "Coefficient Plot for Model 5")  # customize title 

Part 3: Plot Effect for Regression Models

# First, we create a dataframe with all predictor variables
# with only the key IV varies
pred_IV <- tibble(edu = rep(0:15, 2)) %>%         #first, create a df with values of your key IV
  mutate(female = c(rep(0, 16), rep(1, 16)),       #b/c we are looking at interaction effects, 
                                                   #give gender two values, otherwise fix it at mean
         age_recode =  mean(earnings_df$age_recode, na.rm = T),   # fix other variabes at mean
         black = mean(earnings_df$black),
         other = mean(earnings_df$other))


# Examine the df
head(pred_IV, 20) %>% kbl("html") %>% kable_classic_2(full_width = F)
edu female age_recode black other
0 0 43.26429 0.306 0.068
1 0 43.26429 0.306 0.068
2 0 43.26429 0.306 0.068
3 0 43.26429 0.306 0.068
4 0 43.26429 0.306 0.068
5 0 43.26429 0.306 0.068
6 0 43.26429 0.306 0.068
7 0 43.26429 0.306 0.068
8 0 43.26429 0.306 0.068
9 0 43.26429 0.306 0.068
10 0 43.26429 0.306 0.068
11 0 43.26429 0.306 0.068
12 0 43.26429 0.306 0.068
13 0 43.26429 0.306 0.068
14 0 43.26429 0.306 0.068
15 0 43.26429 0.306 0.068
0 1 43.26429 0.306 0.068
1 1 43.26429 0.306 0.068
2 1 43.26429 0.306 0.068
3 1 43.26429 0.306 0.068

Now that we have the dataframe pred_IV ready for predicting the dependent variable (earning), we can use the R function predict() to calculate fitted earning using the regression model and the values specified in each row in pred_IV. Then, use cbind() to combine this fitted Y value vector with your pred_IV for plotting.

# use `predict` to predict the Y
predicted_earning <- predict(m5,                      # the model you are using
                             pred_IV,                # the df you use for predicting
                             interval = "confidence", # set CI
                             level = 0.95)

# bind the columns
pred_result <- cbind(pred_IV, predicted_earning)

# check df
head(pred_result, 20) %>% kbl("html") %>% kable_classic_2(full_width = F)
edu female age_recode black other fit lwr upr
0 0 43.26429 0.306 0.068 22.93127 21.12317 24.73936
1 0 43.26429 0.306 0.068 29.01392 27.46140 30.56644
2 0 43.26429 0.306 0.068 35.09657 33.78940 36.40374
3 0 43.26429 0.306 0.068 41.17922 40.10017 42.25827
4 0 43.26429 0.306 0.068 47.26188 46.38025 48.14350
5 0 43.26429 0.306 0.068 53.34453 52.60462 54.08443
6 0 43.26429 0.306 0.068 59.42718 58.73804 60.11632
7 0 43.26429 0.306 0.068 65.50983 64.76173 66.25793
8 0 43.26429 0.306 0.068 71.59248 70.69714 72.48783
9 0 43.26429 0.306 0.068 77.67514 76.57927 78.77100
10 0 43.26429 0.306 0.068 83.75779 82.43210 85.08348
11 0 43.26429 0.306 0.068 89.84044 88.26842 91.41246
12 0 43.26429 0.306 0.068 95.92309 94.09489 97.75130
13 0 43.26429 0.306 0.068 102.00575 99.91513 104.09636
14 0 43.26429 0.306 0.068 108.08840 105.73122 110.44558
15 0 43.26429 0.306 0.068 114.17105 111.54442 116.79768
0 1 43.26429 0.306 0.068 21.36018 19.52665 23.19371
1 1 43.26429 0.306 0.068 24.31464 22.72939 25.89990
2 1 43.26429 0.306 0.068 27.26911 25.92251 28.61571
3 1 43.26429 0.306 0.068 30.22357 29.09985 31.34729
# Plot
pred_result %>% 
  mutate(gender = ifelse(female == 0, "Male", "Female")) %>%       # Covert dummy to character variable
  ggplot(aes(x = edu, y = fit)) +
  geom_line(aes(linetype = gender)) +                              # group linetype by gender
  geom_ribbon(aes(ymin = lwr, ymax = upr, fill = gender), alpha = 0.3) +   # add 95% CI
  theme_classic() +
  labs(x = "Years of Education",
       y = "Predicted Earnings") +
  ggtitle("Predicted Earnings by Education and Gender",
          subtitle = "(Modeled with interaction between education and gender)")


Part 3 Exercise

Plot the effect of age on earn according to Model 5. Post your plot on Slack.

# Your code here

Part 4: F-test for Nested Models

We can use F-test to compare two regression models. The idea behind the F-test for nested models is to check how much errors are reduced after adding additional predictors. A relatively large reduction in error yields a large F-test statistic and a small P-value. The P-value for F statistics is the right-tail probability.

If the F’s p-value is significant (smaller than 0.05 for most social science studies), it means that at least one of the additional \(\beta_j\) in the full model is not equal to zero.

The F test statistic for nested regression models is calculated by:

\[F = \frac{(SSE_\text{restricted} - SSE_\text{full})/df_1}{SSE_\text{full}/df_2} \] where \(df_1\) is the number of additional predictors added in the full model and \(df_2\) is the residual df for the full model, which equals \((n - 1 - \text{number of IVs in the complete model})\). The \(df\) of the F test statistic is \((df_1, df_2)\).

For example, according to the equation, we can hand-calculate the F value for m4 vs m5:

# SSE_restricted:
sse_m4 <- sum(m4$residuals^2)

# SSE_full:
sse_m5 <- sum(m5$residuals^2)

# We add one additional IV, so:
df1 = 1

# Residual df for the full model (m5):
df2 = m5$df.residual

# Calculate F:
F_stats = ((sse_m4 - sse_m5)/df1)/(sse_m5/df2)
F_stats
## [1] 246.5851
# Check tail probability using `1 - pf()`
1 - pf(F_stats, df1, df2) 
## [1] 0

You can also use anova() to perform a F-test in R.

anova(m4, m5)
# F's p-value is significant: the additional IV in m5 has a non-zero effect

Part 4 Exercise

  1. (You can use pencil and paper) Show that \[\frac{(SSE_\text{restricted} - SSE_\text{full})/df_1}{SSE_\text{full}/df_2} = \frac{(R^2_\text{full} - R^2_\text{restricted})/df_1}{(1 - R^2_\text{full})/df_2}\]

  2. Create a new variable age_sq in earnings_df that is the square term of age_recode. Estimate Model 6: earn ~ age + edu + female + race + edu*female + age_square

Then, perform a F-test between m5 and m6. What is your null and alternative hypothesis? What’s your decision of the F-test?

# Your code here

Part 5: Github: Getting Started


Part 5 Exercise

  1. Sign up Github (your nyu email can give you “pro” Github account).

  2. Download Github Desktop and sign in with your Github account.

  3. In your Github account, create a repository named “lab6_git_demo”. You can either choose it to be “private” or “public”.

  4. Sync this repo to your local folder using Github Desktop (I recommend you to create a designated “git_repos” folder and sync it there).

  5. Copy all lab6 related files to your repo folder. Then, commit and push the changes to your Github repo.

  6. Check your “lab6_git_demo” repo page on Github, does it have all files updated?

  7. Delete some code of your choice in your lab6 coding file, then commit and push the change to your Github repo.

  8. Check your “lab6_git_demo” repo page on Github, click your commit ID, what do you see?